Hantology-A Linguistic Resource for Chinese Language Processing and Studying
نویسندگان
چکیده
Hantology, a character-based Chinese language resource is created to provide an infrastructure for language processing and research on the writing system. Unlike alphabetic or syllabic writing systems, the ideographic writing system of Chinese poses both a challenge and an opportunity. The challenge is that a totally different resources structure must be created to represent and process speaker’s conventionalization of the language. The rare opportunity is that the structure itself is enriched with conceptual classification and can be utilized for ontology building. We describe the contents and possible applications of Hantology in this paper. The applications of Hantology include: (1) an account for the diachronic development of Chinese lexica (2) character-based language processing, (3) a study of conceptual structure differences in Chinese and English, and (4) comparisons of different ideographic writing systems. Introduction The appearance of WordNet has been adopted by many researchers to solve natural language processing problems (Fellbaum, 1998). Several Chinese WordNets based on WordNet also have been developed recently. However, these Chinese WordNets are not able to describe the features of Chinese characters (Huang et al., 2004; Chen et al.; Chang, 2003). In addition, many problems that rarely exited in other natural language processing can not be solved by Chinese WordNets. A typical problem in processing Chinese text is the missing characters problem which is the characters not been encoded in computer systems Chou, 2005). Another problem is the computers can not process variants correctly. Chinese characters have lots of variants which are the different glyphs with the same word or morpheme. For instance, both characters ‘体’ and ‘體’ are the same word and morpheme with different glyphs. Actually, they are variants and can be replaced each other. These problems also cause information retrieval and interchange problems. The WordNet does not represent the relations of variants because the foundations of WordNet are the synonyms. Synonyms are different with variants. Synonyms are the same meaning with different words. Variants are the same word with different forms. The Sentences, phrases, words or characters all are different forms. For computer systems, it is very important to know the meaning and concept carried by this different form. For English, each character is just a writing unit without carrying any concept. Therefore, it does not have the requirement to build resources for alphabetic characters. However, each Chinese character is not only a writing unit but also a concept unit. Because there are lots of relations among concepts, the characters are not independent of each other. The knowledge of languages represented in computers is not only able to solve the problems of natural language processing but also to provide resources for research of languages, so language resources are critical to computational linguistics. However, researchers who study Chinese characters lack resources and have difficulty to get benefits from computers. Although there are several Chinese characters databases have been created, these databases only focus on glyphs or pronunciations of characters and have sharing difficulty with other applications. The purpose of this paper is to introduce Hantology and its applications for computer systems and researchers. Relative Works There are several studies on the creation of Chinese characters database. One important study is Chinese glyph expression database which consists of 59000 glyph structures (Juang & Hsieh, 2005). The glyphs of Chinese characters are decomposed into 4766 basic components. Each Chinese character can be expressed by the basic components. Chinese glyphs database also contains oracle bone, bronze, greater seal and lesser seal scripts. The largest Chinese characters database is Mojikyo font database which contains more than 110000 characters (Ishikawa, 1999). Both Chinese glyph expression database and Mojikyo font database contain only glyph knowledge. Yung created an ancient pronunciations database for Chinese characters(Yung, 2003). Hsieh proposed a HanziNet which represent Characters by 16 bits binary code (Hsieh, 2005). Chinese characters are classified into hierarchy categories. HanziNet can describe the upper layer concept of a character. These previous studies only conceded on one dimension of Chinese characters. However, each Chinese character consists of glyphs, scripts, pronunciations, senses, and variants dimensions. The previous studies can not provide enough knowledge for computer applications and researchers. Chou and Huang propose an ontology named Hantology to provide glyph, script, pronunciation, sense, and variants of Chinese characters (Chou, 2005; Chou & Hung, 2005) The
منابع مشابه
The Extended Architecture of Hantology for Japan Kanji
Chinese writing system is not only used by Chinese but also used by Japanese. The motivation of this paper is to extend the architecture of Hantology which describes the features of Chinese writing system to integrate Japan Kanji into the same ontology. The problem is Chinese characters adopted by Japan have been changed, thus, the modification of the original architecture of Hantology is neede...
متن کاملHanzi Grid Toward a Knowledge Infrastructure for Chinese Character-based Cultures
Abstract. The long-term historical development and broad geographical variation of Chinese character (Hanzi/Kanji) has made it a crosscultural information sharing platform in East Asia. However, due to the lack of proper research framework, the integration of heterogeneous knowledge grounded in Hanzi and its variants has been a thorny problem. In this paper, we propose a theoretical framework f...
متن کاملMainland Chinese Students’ Shifting Perceptions of Chinese-English Code-Mixing in Macao
As a former Portuguese colony, Macao is the only region in China where Cantonese, a variety of Chinese, and English, an international language, are enjoying de facto official statuses, with Putonghua being a quasi-official language and Portuguese being another official language. Recently, with an increasing number of Mainland Chinese students crossing the border to pursue their tertiar...
متن کاملA Supervised Method for Constructing Sentiment Lexicon in Persian Language
Due to the increasing growth of digital content on the internet and social media, sentiment analysis problem is one of the emerging fields. This problem deals with information extraction and knowledge discovery from textual data using natural language processing has attracted the attention of many researchers. Construction of sentiment lexicon as a valuable language resource is a one of the imp...
متن کاملGender Concept “Woman” in the Minds of the Russian People (Taking the Chinese as Reference) According to an Associative Experiment
The article is devoted to the study of language representations of the concept of “woman” in the minds of the Russian and Chinese people based on a comparison of associative experiments of two languages, identifying the dynamics of the concept in the language consciousness of the people, establishing the specificity of the concept in the Russian language picture of the world referring to the Ch...
متن کامل